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Distance correlation is a new measure of dependence between ran- 
dom vectors. Distance covariance and distance correlation are anal- 
ogous to product-moment covariance and correlation, but unlike the 
classical definition of correlation, distance correlation is zero only if 
the random vectors are independent. The empirical distance depen- 
dence measures are based on certain Euclidean distances between 
sample elements rather than sample moments, yet have a compact 
representation analogous to the classical covariance and correlation. 
Asymptotic properties and applications in testing independence are 
discussed. Implementation of the test and Monte Carlo results are 
also presented. 

1. Introduction. Distance correlation provides a new approach to the 
problem of testing the joint independence of random vectors. For all dis- 
tributions with finite first moments, distance correlation 1Z generalizes the 
idea of correlation in two fundamental ways: 

(i) 1Z(X, Y) is defined for X and Y in arbitrary dimensions; 

(ii) 1Z(X,Y) = characterizes independence of X and Y. 

Distance correlation has properties of a true dependence measure, analogous 
to product-moment correlation p. Distance correlation satisfies < 1Z < 1, 
and 1Z = only if X and Y are independent. In the bivariate normal case, 
1Z is a function of p, and 1Z(X,Y) < \p(X,Y)\ with equality when p = ±1. 

Throughout this paper X in MP and Y in M q are random vectors, where 
p and q are positive integers. The characteristic functions of X and Y are 
denoted fx and fy, respectively, and the joint characteristic function of X 
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and Y is denoted fx,Y- Distance covariance V can be applied to measure the 
distance \\fx,Y(t,s) — /x(i)/y(s)|| between the joint characteristic function 
and the product of the marginal characteristic functions (the norm || • || is 
defined in Section 2), and to test the hypothesis of independence 



The importance of the independence assumption for inference arises, for ex- 
ample, in clinical studies with the case-only design, which uses only diseased 
subjects assumed to be independent in the study population. In this design, 
inferences on multiplicative gene interactions (see [1] ) can be highly distorted 
when there is a departure from independence. Classical methods such as the 
Wilks Lambda [14] or Puri-Sen [8] likelihood ratio tests are not applicable if 
the dimension exceeds the sample size, or when distributional assumptions 
do not hold (see, e.g., [7] regarding the prevalence of nonnormality in biol- 
ogy and ecology) . A further limitation of multivariate extensions of methods 
based on ranks is that they are ineffective for testing nonmonotone types of 
dependence. 

We propose an omnibus test of independence that is easily implemented 
in arbitrary dimension. In our Monte Carlo results the distance covariance 
test exhibits superior power against nonmonotone types of dependence while 
maintaining good power performance in the multivariate normal case rela- 
tive to the parametric likelihood ratio test. Distance correlation can also be 
applied as an index of dependence; for example, in meta-analysis [12] dis- 
tance correlation would be a more generally applicable index than product- 
moment correlation, without requiring normality for valid inferences. 

Theoretical properties of distance covariance and correlation are covered 
in Section 2, extensions in Section 3, and results for the bivariate normal 
case in Section 4. Empirical results are presented in Section 5, followed by 
a summary in Section 6. 

2. Theoretical properties of distance dependence measures. 

Notation. The scalar product of vectors t and s is denoted by (t, s). For 
complex- valued functions /(•), the complex conjugate of / is denoted by 
/ and | /| 2 = //. The Euclidean norm of x in MP is \x\ p . A sample from 
the distribution of X in MP is denoted by the n x p matrix X, and the 
sample vectors (rows) are labeled X±, . . . ,X n . A primed variable X' is an 
independent copy of X; that is, X and X' are independent and identically 
distributed. 

Definition 1. For complex functions 7 defined on MP x M q the || • \\ w - 
norm in the weighted L2 space of functions on M p+q is defined by 



Ho- fx,Y = fxfy vs. H\: fx,Y ^ IxSy- 
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where w(t,s) is an arbitrary positive weight function for which the integral 
above exists. 

2.1. Choice of weight function. Using the || • ||„,-norm (2.1) with a suit- 
able choice of weight function w(t,s), we define a measure of dependence 



V 2 (X,Y;w) = \\fx,Y(t,s)-f x (t)f Y {s 

(2.2) 



\fx.Y(t,s) - f x (t)f Y (s)\ 2 w(t,s)dtds, 

M.P+Q 

such that V 2 (X, Y;w) = if and only if X and Y are independent. In this 
paper V will be analogous to the absolute value of the classical product- 
moment covariance. If we divide V(X,Y;w) by yjV(X; w)V(Y; w), where 



V\X;w) = / \fx,x(t,s) ~ fx(t)fx(s)\Mt,s)dtds, 

we have a type of unsigned correlation 1Z W . 

Not every weight function leads to an "interesting" 1Z W , however. The 
coefficient 1Z W should be scale invariant, that is, invariant with respect to 
transformations (X,Y) t— > (eX,eY), for e > 0. We also require that TZ W is 
positive for dependent variables. It is easy to check that if the weight function 
w(t, s) is integrable, and both X and Y have finite variance, then the Taylor 
expansions of the underlying characteristic functions show that 

inn - tkMM =AX,Y). 
e-o ^/V 2 (eX;w)V 2 (eY;w) ' V ' 

Thus for integrable w, if p = 0, then 1Z W can be arbitrarily close to zero even 
if X and Y are dependent. However, by applying a nonintegrable weight 
function, we obtain an 1Z W that is scale invariant and cannot be zero for 
dependent X and Y. We do not claim that our choice for w is the only 
reasonable one, but it will become clear in the following sections that our 
choice (2.4) results in very simple and applicable empirical formulas. (A more 
complicated weight function is applied in [2], which leads to a more com- 
putationally difficult statistic and does not have the interesting correlation 
form.) 

The crucial observation is the following lemma. 

Lemma 1. // < a < 2, then for all x in R d 

1 — cos(i, x) 



t 



d+a 



■ dt = C(d, a)\x\ 



\d 

where 

27r d / 2 r(l - a/2) 



C(d,a) 



a2 a T((d + a)/2) 
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and r(-) is the complete gamma function. The integrals at and oo are 
meant in the principal value sense: lim e _^o fM.d\{ £ B+e- 1 B c }> w ^ iere B is the 
unit ball (centered at 0) in M. d and B c is the complement of B. 

See [11] for the proof of Lemma 1. In the simplest case, a = 1, the constant 
in Lemma 1 is 



(2.3) 



c d = C(d, 1) 



7T 



(l+d)/2 



r((i + d)/2)' 

In view of Lemma 1, it is natural to choose the weight function 
(2.4) w(t,s) = (c p cq\t\ 



|l+p| 

\p \°\q J ) 



corresponding to a = 1. We apply the weight function (2.4) and the cor- 
responding weighted L2 norm || • ||, omitting the index w, and write the 
dependence measure (2.2) as V 2 (X,Y). In integrals we also use the symbol 
du, which is defined by 

duj = {c p c q \t\ 1 p +p \s\ 1 q +q )- 1 dtds. 

For finiteness of \\fx,Y(t, s) — fx{t)f Y (s)\\ 2 it is sufficient that ^|-X"| p < 00 
and E\Y\ q < 00. By the Cauchy-Bunyakovsky inequality 

\fx,y(t, s) - fx(t)fy(s)\ 2 = [E(e^ - fx{t)){^ Y) ~ M*))] 2 

< E [ e i(t,x) _ f x ( t )} 2 E[e^ - f Y (s)] 2 

= (l-\f x (t)\ 2 )(l-\fy( S )\ 2 ). 

If I2(|X|p + \Y\ g ) < 00, then by Lemma 1 and by Fubini's theorem it follows 
that 



\fx,Y(t,s)-fx(t)f Y (s)\ 2 dLO 



< 



(2.5) 



1 -!/*(*)!* 

Cp\i\p 



dt 



i-IM-)l J 



Cn S 



gl a l<? 



1+9 



ds 



E 



l-cos(£,X-X') 
r \t\ 1+p 



dt 



■E 



i-cos(s,y-y / ) 



Cn S 



1+9 



ds 



E\X-X'\ p E\Y -Y'\ q <oo. 



Thus we have the following definitions. 
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Definition 2 (Distance covariance). The distance covariance (dCov) 
between random vectors X and Y with finite first moments is the nonneg- 
ative number V(X,Y) defined by 

V\X,Y)=\\f XiY {t,s)-fx(t)fY(s)f 

<2 ' 6) !/«*»)- /.vMM*^ 



C p C q jR P +g |i|p +P |s|g +f; 

Similarly, distance variance (dVar) is defined as the square root of 
V 2 {X) = V 2 {X,X) = \\fx,x(t,s) - fx(t)f x (s)\\ 2 . 



Remark 1. If E(\X\ p + \Y\ q ) = oo but E(\X\° + \Y\°) < oo for some 
< a < 1, then one can apply and IZ^ (see Section 3.1); otherwise 
one can apply a suitable transformation of (X,Y) into bounded random 
variables (X,Y) such that X and Y are independent if and only if X and 
Y are independent. 

Definition 3 (Distance correlation). The distance correlation (dCor) 
between random vectors X and Y with finite first moments is the nonneg- 
ative number TZ(X, Y) defined by 

V 2 (X,Y) 



V 2 (X)V 2 (Y)>0, 
(2.7) U 2 {X,Y) = l ^V 2 (X)V 2 (Y) 

U, V 2 (X)V 2 (Y) = 0. 

Clearly the definition of 1Z in (2.7) suggests an analogy with the product- 
moment correlation coefficient p. Analogous properties are established in 
Theorem 3. The relation between V, 1Z and p in the bivariate normal case 
will be established in Theorem 7. 

The distance dependence statistics are defined as follows. For an observed 
random sample (X, Y) = {(X/., Y^) : k = 1, . . . , n} from the joint distribution 
of random vectors X in MP and 7 in R ? , define 

n 1 n 

a>ki = \Xk — X[\ p , Ofc. = — >afc/, a.[, = — > a^, 

i n 

a.. = — 7) >> a kh A-ki = &ki — - a-l + a--, 

n i, ; 1 
fe,(=l 
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k,l = l,...,n. Similarly, define b k i = \Y k - Yi\ q and B k i = b k i - b k . - b.i + b.., 
for k, I = 1, . . . , n. 

Definition 4. The empirical distance covariance V n (X, Y) is the non- 
negative number defined by 

1 n 

(2-8) V^X,Y) = - ]T A kl B kl . 

Similarly, V rt (X) is the nonnegative number defined by 

(2.9) V2(X)=V*(X,X) = 1 f; A\ v 

n k,i=i 

Although it may not be immediately obvious that V^(X, Y) > 0, this fact 
as well as the motivation for the definition of V n will be clear from Theorem 
1 below. 

Definition 5. The empirical distance correlation 7£ n (X, Y) is the square 
root of 

V2(X)V2(Y)>0, 



(2.10) K(X,Y) = \ ^(X)V2(Y) 

0, V2(X)V2(Y) = 0. 

Remark 2. The statistic V n (X) = if and only if every sample observa- 
tion is identical. Indeed, if V n (X) = 0, then A k i = for k, I = 1, . . . , n. Thus 
= A kk = —a k . — a. k + a., implies that a k . = a. k = a../2, and 

= A k i = a k i — a k . — a.i + a.. = a k i = \X k — Xi\ p , 

so X\ = ■ ■ ■ = X n . 

It is clear that lZ n is easy to compute, and in the following sections it will 
be shown that lZ n is a good empirical measure of dependence. 

2.3. Properties of distance covariance. It would have been natural, but 
less elementary, to define V n (X,Y) as — /x(*)/y(' s )ll) where 

1 n 

fx,Y^ s ) = -Y, e Mi(t,Xk) +i(s,Y k )} 

is the empirical characteristic function of the sample, {(Xi,Y\), . . . , (X n ,Y n )}, 
and 

, n , n 

fxit) = - E exp{i(t,X fe )}, ms) = -£ exp{;< S ,y fc )} 
n = i n ti 
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are the marginal empirical characteristic functions of the X sample and Y 
sample, respectively. Our first theorem shows that the two definitions are 
equivalent. 

Theorem 1. // (X, Y) is a sample from the joint distribution of (X, Y), 
then 

v 2 n (x,Y) = \\fi Y (t,s)-mt)ms)\\ 2 . 

Proof. Lemma 1 implies that there exist constants c p and c q such that 
for all X in W, Y in R q , 

f l-expWt,X)} 
JRP \t\p 

P.12) ( l -^ Y)] ds = c,\Y\„ 

JRI \s\q 

(2.13) l l 1 ~ ""Stf V + : (s ' Y)} <* " - v&w i.. 



where the integrals are understood in the principal value sense. For sim- 
plicity, consider the case p = q = 1. The distance between the empirical 
characteristic functions in the weighted norm w(t, s) = ir~ 2 t~ 2 s~ 2 involves 
\f% iY (t,s)\ 2 , \f x {t)f Y {s)\ 2 and fl Y (t,s)f%(t)f2(s). For the first we have 

1 n 

fx,Y(t,s)f^ Y (t,s) = — coniXk-XfitcoBiYk-Ytis + Vu 
n k,i=i 

where V\ represents terms that vanish when the integral \f\ y(i, s) — /x(^)/y( s ) 
is evaluated. The second expression is 



1 n 1 n 

= — E cos (^ - x i) f — E cos ( y fe - Y s + y 2 



n kl=l n k,l=l 



and the third is 



1 



k.l. m=l 



cos(X fc -X / )tcos(y fc -y m )s + T/ 3 , 



where V2 and V3 represent terms that vanish when the integral is evaluated. 
To evaluate the integral ||/xy(^ s ) — /x(0/y( s )l| 2 ' a PPly Lemma 1, and 
statements (2.11), (2.12) and (2.13) using 

coswcosf = 1 — (1 — cosu) — (1 — cost;) + (1 — cosu)(l — cosv). 
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After cancellation in the numerator of the integrand it remains to evaluate 
integrals of the type 

dt ds 

dt 

ds 

~2 



/ (1 - cos(X k - Xi)t){l - cos(Y fc - Y t ) S ) 

JR 2 

= [ (1 - cos(X k - X t )t) 
Jr 

x / (1 - cos(Y fc - Y t )s) 

JR 

= c\\X k -X l \\Y k -Y l \. 



For random vectors X in W and Y in R 9 , the same steps are applied 
using w(t,s) = {c p c q \t\l +p \s\l +q }- 1 . Thus 

(2.14) \\fl Y (t, s) - mm(s)\\ 2 = S 1 + S 2 - 2S 3 , 
where 

1 " 

(2.15) S 1 = —J2\ x k- Xi\ p \Y k - Yi\ q , 



n W =l 



(2.16) S 2 = — E \*k ~ Xi\ P — E l y * ~ 



n k,l=i n fc,j=i 



(2.17) 53 = ^sE E i^fe-^wn-^m| g . 

fc=l i,m=l 

To complete the proof we need to verify the algebraic identity 

(2.18) \%(X,Y) = S 1 + S 2 -2S 3 . 

For the proof of (2.18) see the Appendix. Then (2.14) and (2.18) imply that 

vs(x,y) = m >Y (t, S ) - m)w{s)\\ 2 . □ 

Theorem 2. If E\X\ p < 00 and E\Y\ q < 00, then almost surely 

(2.19) lim V n (X,Y)=V(X,Y). 

n — >oo 

Proof. Define 

-. n -. n 1 n 

£U S ) = -y e i(t^*)-N(-,n> _ i y e «('A)iy e *(-.n> 

so that = ||£n(t, s )l| 2 - Then after elementary transformations 

2 n \ n 1 n 
£n(*, s) = - E E U k~ E 

n f— ' n f— ' n ' 

fe=i fe=i fe=i 
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where u k = exp(i(t, X k )) - f x (t) and v k = exp(i(s,Y k )) - /y(s). 
For each 5 > define the region 

(2.20) D(5) = {(f , s) : 5 < \t\ p < 1/5,6 < \s\ q < 1/5} 

and random variables 

\£ n {t,s)\ 2 dtu. 



vV 

y n,d 



D(5) 



For any fixed 5 > 0, the weight function w(t,s) is bounded on D(5). Hence 
V 2 g is a combination of V^-statistics of bounded random variables. For each 
5 > by the strong law of large numbers (SLLN) for ^-statistics, it follows 
that almost surely 



lim Vls = V? >5 



D(S) 



\fx,Y(t^)-fx(t)fY(s)\ 2 du;. 



Clearly V 2 S converges to V 2 as 5 tends to zero. Now it remains to prove that 
almost surely 

(2.21) 



lim sup lim sup \V 2 $ — V 2 \ =0. 

5^0 n— >oo 



For each 5 > 



(2.22) 



|V^-V 2 |< / |UM)| 2 ^+ / \Ut,s)\ 2 du; 
J\t\ p <5 J\t\ p >l/S 



+ 



\s\g<5 



\£ n (t,s)\ 2 duj + 



\s\ q >X/S 



Mt,s)\ 2 du. 



For z = {z\,Zi, ■ ■ ■ , z p ) in M p define the function 

1 — cos Z\ 



G{y) 



\A<y 



\z\ l +P 



dz. 



Clearly G(y) is bounded by Cp and lim^o G(y) = 0. Applying the inequality 
\ x + y\ 2 < 2|x| 2 + 2\y\ 2 and the Cauchy-Bunyakovsky inequality for sums, 
one can obtain that 

2 

IUM)| 2 <2 ' 



(2.23) 



1 n 


2 


- V] u k v k 


+ 2 







I " 1 " 



A n 1 n 

< f EKi 2l E 

n fc = i re iS 



Hence the first summand in (2.22) satisfies 



u k \ 2 dt 1 f \v k \ 2 ds 



(2.24) / \Ut,s)\ 2 dcu<-Y [ ™*L±Y t 
J\t\ P <S n ^i J \t\p<s Cp\t\p p n ^Jm 



C n S 



1+9 ' 
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Here \v k \ 2 = 1 + \f Y (s)\ 2 - e'^Ms) - e~ i{ - s ' Y ^ f Y (a), thus 

| o | q 



I ^ffe = {2Ey\Y k -Y\- E\Y - Y'\) < 2(\Y k \ + E\Y\ 

JRI C n \s\a 



where the expectation Ey is taken with respect to Y, and Y' = Y is inde- 
pendent of Yfc. Further, after a suitable change of variables 

/ ^rfel = 2E ^ X " ~ X \ G (\ X * ~ ~ E\X - X'\G{\X - X'\5) 

J\t\ p <5 C p \t\p 

<2E x \X k -X\G(\X k -X\5), 

where the expectation Ex is taken with respect to X, and X' = X is inde- 
pendent of X k . Therefore, from (2.24) 

f \Cn{t,s)\ 2 cL0 
J\t\ p <5 

< 4- J2(\Y k \ + E\Y\)- Y, Ex\X k - X\G(\X k - X\S). 

By the SLLN 

limsup ( \Cn(t,s)\ 2 cLo<A-2-2E\Y\ ■2E\X 1 - X 2 \G(\X 1 - X 2 \5) 

n— »oo J\t\ p <8 

almost surely. Therefore by the Lebesgue bounded convergence theorem for 
integrals and expectations 

limsup limsup / \Cn(t, s)\ 2 duj = 

5^0 n-*oo J\t\ p <8 

almost surely. 

Consider now the second summand in (2.22). Inequalities (2.23) imply 
that \u k \ 2 < 4 and - J2k=i \ u k\ 2 < 4, hence 



r \u k \ 2 dt < r dt r l A, 2 ds 

J\tL>l/S cJt\l +p ~ J\tL>l/S cJt\i +p Jm n !H - 



i\t\ P >l/S Cp|t|p ^ J\t\p>i/8 Cpltlp^ JM." n k=1 Cq \s lq 
<m-Y(\Yk\+E\Y\). 

Thus, almost surely 

limsuplimsup / \£n(t, s)\ 2 du> = 0. 

S~*0 n->oc J\t\ P >l/S 

One can apply a similar argument to the remaining summands in (2.22) to 
obtain (2.21). □ 
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Corollary 1. If E(\X\ P + \Y\ q ) < oo, then almost surely, 
lim TZ 2 n {X,Y)=TZ 2 {X,Y). 

The definition of dCor suggests that our distance dependence measures 
are analogous in at least some respects to the corresponding product-moment 
correlation. By analogy, certain properties of classical correlation and vari- 
ance definitions should also hold for dCor and dVar. These properties are 
established in Theorems 3 and 4. 

Theorem 3 (Properties of dCor). 

(i) IfE(\X\ p + \Y\ q ) <oo, then 0<K<1, andK{X,Y) = if and only 
if X and Y are independent. 

(ii) 0<TZ n < 1. 

(iii) IfH n (X., Y) = 1, then there exist a vector a, a nonzero real number 
b and an orthogonal matrix C such that Y = a + 6XC. 

Proof. In (i), 1Z(X,Y) exists whenever X and Y have finite first mo- 
ments, and X and Y are independent if and only if the numerator 

V 2 (X,Y) = \\f x , Y (t,s)- f x (t)f Y (s)\\ 2 

of H 2 {X, Y) is zero. Let U = e^' X > - f x (t) and V = e l( - s ^ - f Y (s). Then 

\fx,v(t,s) - fx(t)fy(s)\ 2 = \E[UV)\ 2 < (E[\U\ \V\]) 2 < E[\U\ 2 \V\ 2 } 

= (l-\f x (t)\ 2 )(l-\f Y (s)\ 2 ). 

Thus 

/ \fx,Y(t,s)-fx(t)fy(s)\ 2 du; 

< [ \(l-\f x (t)\ 2 )(l-\f Y (s)\ 2 )\ 2 du;; 

hence < TZ(X,Y) < 1, and (ii) follows by a similar argument. 

(iii) If 1Z n (X.,Y) = 1, then the arguments below show that X and Y are 
similar almost surely, thus the dimensions of the linear subspaces spanned 
by X and Y respectively are almost surely equal. (Here similar means that 
Y and eX are isometric for some e ^ 0.) For simplicity we can suppose that 
X and Y are in the same Euclidean space and both span MP. From the 
Cauchy-Bunyakovski inequality it is easy to see that 7£„(X,Y) = 1 if and 
only if A). i = sB^i for some factor e. Suppose that |e| = 1. Then 

\X k -X l \ p = \Y k -Y l \ q + d k + di 
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for all k,l, for some constants d}-,di. Then with k = I we obtain dk = for 
all k. Now, one can apply a geometric argument. The two samples are iso- 
metric, so Y can be obtained from X through operations of shift, rotation 
and reflection, and hence Y = a + 6XC for some vector a, b = e and orthog- 
onal matrix C. If \e\ 7^ 1 and e ^ 0, apply the geometric argument to eX 
and Y and it follows that Y = a + 6XC where b = e. □ 

Theorem 4 (Properties of dVar). The following properties hold for ran- 
dom vectors with finite first moments: 

(i) dVar(X) = implies that X = E[X], almost surely. 

(ii) dV&r(a + bCX) = \ b\ dVar(X) for all constant vectors a in M p , scalars 
b and p x p orthonormal matrices C . 

(hi) dVar(X + Y) < dVar(X) + dVar(y) for independent random vectors 
X in W and Y in W. 



PROOF, (i) If dVar(X) = 0, then 

JrpJrp (%\t\p \s\p 

or equivalently, fx,x(t,s) = fx(t + s) = fx(t)fx(s) for all t, s. That is, 
fx(t) = e*( c '^ for some constant vector c, and hence X is a constant vector, 
almost surely. 

Statement (ii) is obvious. 

(hi) Let A = fx+Y,x+Y{t,s)-fx(t + s)f Y (t + s), Aj = fx(t,s)- fx(t)f x (s) 
and A2 = /y(t, s) — /y(i)/y( s )- K X and Y are independent, then 

A = fx+Y,x+y{t, s) - f x +Y(t)fx+Y{s) 

= fx(t + s)f Y (t + s)- fx{t)fx(s)f Y (t)f Y (s) 

= [f x (t + s )-f x (t)fx(s)}f Y (t + s) 

+ fx(t)f x (s)[f Y (t + s)- f Y (t)f Y (s)] 

= A 1 f Y (t + s) + f x (t)fx(s)A 2 , 

and therefore |A| 2 < |Ax| 2 + |A 2 | 2 + 2|Ai| |A 2 |. Equality holds if and only if 
A1A2 = 0, that is, if and only if X or Y is a constant almost surely. □ 

2.4. Asymptotic properties of raV 2 . Our proposed test of independence 
is based on the statistic nV 2 /^. If E(\X\ p + \Y\ q ) < 00, we prove that under 
independence raV 2 /^ converges in distribution to a quadratic form 



(2.25) Q = Y. X J Z l 
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where Zj are independent standard normal random variables, {Xj} are non- 
negative constants that depend on the distribution of (X,Y) and E[Q] = 1. 
A test of independence that rejects independence for large nV 2 /S2 is statis- 
tically consistent against all alternatives with finite first moments. 

Let £(•) denote a complex- valued zero- mean Gaussian random process 
with covariance function 



R(u,u ) = (f x {t - 1 ) - fx(t)f x (t ))(f Y (s - s ) - f Y (s)f Y (s )), 
where u = (t,s),u = (t ,s ) G W x R 9 . 

Theorem 5 (Weak convergence). If X and Y are independent anc 
E(\X\ p + \Y\ q ) < oo, then 

n^ n £,jat,s)f. 

Proof. Define the empirical process 

C„(u) = Cn(t, s) = V^Ut, s) = V^ifxAt, s) - mt)fy(s)). 



Under the independence hypothesis, E[( n (u)] = and E[( n (u)( n (uo)] = 2— - x 
R(u,u ). In particular, E\( n (u)\ 2 = 2=1(1 - \fx(t)\ 2 )(l - \fy(s)\ 2 ) < 1. 

For each S > we construct a sequence of random variables {Q n (5)} with 
the following properties: 

(i) Q n (5) converges in distribution to a random variable Q(5). 

(ii) E\Q n {6) - Cn| < S. 

(iii) E\Q(6)-(\<6. 

Then the weak convergence of \\( n \\ 2 to ||£|| 2 follows from the convergence 
of the corresponding characteristic functions. 

The sequence Q n {b) is defined as follows. Given e > 0, choose a partition 
{DfcjfcLi of D{5) (2.20) into N = N(e) measurable sets with diameter at 
most e. Define 

N 

Qn{5) = Y. \Qn?du. 
k=l JD * 

For a fixed M > let 

[3(e) = sup E\\( n (u)\ 2 -\( n (uo)\ 2 \, 

where the supremum is taken over all u = (t,s) and uo = (to,so) such that 
max{|t|, |i |, |s |} < M, and \t - 1 \ 2 + \s - s \ 2 < e 2 . Then lim e _> /?(e) = 
for every fixed M > 0, and for fixed 5 > 



E 



\( n (u)\ 2 dw-Q n (S) 

D(S) 



<(3(e) / \Cn(u)\ z du— > = 0. 
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On the other hand, 



D 



\Cn(u)\ 2 du 



\( n (u)\ 2 dL> 



< 



\t\<8 



\( n (u)\ 2 dw + 



\t\>i/s 



\Cn(u)\ 2 dLU 



+ 



\s\<5 



\( n (u)\ 2 duj + 



\s\>l/S 



\Cn(u)\ 2 diJ. 



By similar steps as in the proof of Theorem 2, one can derive that 



E 



\Cn{u)\ 2 duJ + 
\t\<8 

n — 1 
< 



|t|>i/<5 



\Cn(u)\ 2 dlU 



{E\X X - X 2 \G(\X l - X 2 \S) + w p 5)E\Y 1 - Y 2 \S-^0, 
n <5^o 



where w p is a constant depending only on p, and similarly 



E 



\s\<5 



\Uu)\ 2 du; + 



\s\>l/8 



\Uu)\ 2 du; 



5^0 



•0. 



Similar inequalities also hold for the random process f (t, s) with 

N 



The weak convergence of Q n {8) to Q(S) as n — > oo follows from the multi- 

;2 _ ll/- ||2 _^ ||£||2_ 

n — >oo 



variate central limit theorem, and therefore nV 2 = \\^ n \ 



□ 



Corollary 2. If E(\X\ P + \Y\ q ) < oo, then: 

(i) If X and Y are independent, nV 2 IS 2 — — ► Q where Q is a non- 

n — >oo 

negative quadratic form of centered Gaussian random variables (2.25) and 
E[Q} = 1. 

P 

(ii) If X and Y are dependent, then nV^/ S 2 — ► oo. 



Proof, (i) The independence of X and Y implies that Q n and thus Q is a 
zero-mean process. According to Kuo [5], Chapter 1, Section 2, the squared 



norm 



(2.26) 



of the zero-mean Gaussian process Q has the representation 



|2 d 



i=i 



where Zj are independent standard normal random variables, and the non- 
negative constants {Xj} depend on the distribution of (X,Y). Hence, under 
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independence, nV 2 converges in distribution to a quadratic form (2.26). It 
follows from (2.5) that 

E\\(\\ 2 = f R(u,u)cLu 

= [ (l-\f X (t)\ 2 )(l-\f Y ( S )\ 2 )ck0 

= E(\X-X%\Y -Y%). 

By the SLLN for ^-statistics, S 2 ^ E(\X - X'UY -Y'L). Therefore nVll 
S*2 — > Q, where E[Q] = 1 and Q is the quadratic form (2.25). 

n — >oo 

(ii) Suppose that X and Y are dependent and _E(|X| p + \Y\ q ) < oo. 
Then V(X,Y) > 0, Theorem 2 implies that V2(X, Y) ^ V 2 (X, Y) > 0, and 

71 — »oo 

therefore nV 2 (X, Y) oo. By the SLLN, S 2 converges to a constant and 
therefore nV'ilSi — ► oo. □ 

Theorem 6. Suppose T(X,Y,a,n) is the test that rejects independence 

>f 

>->2 

where $(•) denotes the standard normal cumulative distribution function, 
and let a(X,Y,n) denote the achieved significance level of T(X,Y,a,n). If 
E(\X\ P + \Y\ q ) < oo, then for all0<a< 0.215 

(i) lim^oo a(X, Y,n) <a, 

(ii) sup x y {lim^oo a(X, Y, n) : V(X, Y) = 0} = a. 

Proof, (i) The following inequality is proved as a special case of a 
theorem of Szekely and Bakirov [9], page 189. If Q is a quadratic form of 
centered Gaussian random variables and E[Q] = 1, then 

P{Q> ($ _1 (l-«/2)) 2 }<a 

for all < a < 0.215. 

(ii) For Bernoulli random variables X and Y we have that lZ n (X., Y) = 
|/5(X, Y)|. By the central limit theorem, under independence ■ v /np(X,Y) is 
asymptotically normal. Thus, in case X and Y are independent Bernoulli 
variables, the quadratic form Q contains only one term, Q = Z\ , and the 
upper bound a is achieved. □ 
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Thus, a test rejecting independence of X and Y when y/nV^/S^ > — 
a/2) has an asymptotic significance level at most a. The asymptotic test cri- 
terion could be quite conservative for many distributions. Alternatively one 
can estimate the critical value for the test by conditioning on the observed 
sample, which is discussed in Section 5. 

Remark 3. If E\X\ 2 < oo and E\Y\ 2 q < oo, then £?[|X| p |Y| g ] < oo, so by 
Lemma 1 and by Fubini's theorem we can evaluate 

V 2 (X, Y) = E[\X! - X 2 \ p \Y 1 - Y 2 \ q ] + E\X X - X 2 \ p E\Y l - Y 2 \ q 

-2E[\X 1 -X 2 \ P \Y 1 -Y 3 \ q ]. 

If second moments exist, Theorem 2 and weak convergence can be estab- 
lished by ^-statistic limit theorems [13]. Under the null hypothesis of inde- 
pendence, V 2 is a degenerate kernel F-statistic. The first-order degeneracy 
follows from inequalities proved in [10]. Thus nV 2 converges in distribution 
to a quadratic form (2.26). 



3. Extensions. 



3.1. The class of a- distance dependence measures. We introduce a one- 
parameter family of distance dependence measures indexed by a positive 
exponent a. In our definition of dCor we have applied exponent a = 1. 

Suppose that £?([ X|p + \Y\ q ) < oo. Let denote the a-distance covari- 
ance, which is the nonnegative number defined by 

V 2 ^(X,Y) = \\f xx (t, S )-f x (t)f Y (s)\\l 

1 f \fx,y(t,s)-f x (t)f Y (s)\ 2 

C(p,a)C(q,a) J RP+q \t\^ +p \s\^ +q 

Similarly, denotes a-distance correlation, which is the square root of 

K 2 ^ = , v2W ( X ' y ) ; <V 2 ^(X), V 2 ^(Y)<oo, 
y / V 2 H(X)V 2 ( Q )(y) 

and ft(°0 = if V 2( - a \X)V 2 ^ (Y) = 0. 

The a-distance dependence statistics are defined by replacing the expo- 
nent 1 with exponent a in the distance dependence statistics (2.8), (2.9) and 
(2.10). That is, replace a^i = \Xk — Xi\ p with a^i = \X^ — Xi\ p and replace 

hi = \Yk-Yi\q witn hi = \Yk~Yi\q, k,l = l,...,n. 

Theorem 2 can be generalized for || • || Q -norms, so that almost sure con- 
vergence of Vn — * follows if the a-moments are finite. Similarly one 
can prove the weak convergence and statistical consistency for a exponents, 
< a < 2, provided that a moments are finite. 
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The case a = 2 leads to the counterpart of classical correlation and co- 
variance. In fact, if p = q = 1, then TZ^ = |p|, IZf 2 = \p\ and Vn^ = 2\a xy \, 
where a xy is the maximum likelihood estimator of Cov(X,Y). 

3.2. Affine invariance. Group invariance is an important concept in sta- 
tistical inference (see Eaton [3] or Giri [4]), particularly when any transfor- 
mation of data and/or parameters by some group element constitutes an 
equivalent problem for inference. For the problem of testing independence, 
which is preserved under the group of affine transformations, it is natural to 
consider dependence measures that are affine invariant. Although 1Z(X,Y) 
as defined by (2.7) is not affine invariant, it is clearly invariant with respect 
to the group of orthogonal transformations 

(3.1) X^cn + hdX, Y^a 2 + b 2 C 2 Y, 

where a\, a 2 are arbitrary vectors, b\, b 2 are arbitrary nonzero numbers 
and Ci, C 2 are arbitrary orthogonal matrices. We can also define a distance 
correlation that is affine invariant. 

For random samples X from the distribution of X in MP and Y from the 
distribution of Y in M. q , define the scaled samples X* and Y* by 

(3.2) X*=XS X 1/2 , Y*=Y5~ 1/2 , 

where Sx and Sy are the sample covariance matrices of X and Y, respec- 
tively. Although the sample vectors in (3.2) are not invariant to affine trans- 
formations, the distances \Xt — X* | and |Y" fc * — Y* | , k, I = 1, . . . , n, are invari- 
ant to affine transformations. Then the affine distance correlation statistic 
7£*(X, Y) between random samples X and Y is the square root of 

7->*2/-y V"l = ^nO^* i Y*) 

Properties established in Section 2 also hold for and 7£* , because the 
transformation (3.2) simply replaces the weight function {c F> c <? |i|p~ l ~ ?: '|s|q +9 } — 1 

with the weight function {c p c (? |5]/ 2 t|p +p |<S'^/ 2 s|g + ' J }~ 1 . 

4. Results for the bivariate normal distribution. Let X and Y have 
standard normal distributions with Cov(X,Y) = p(X,Y) = p. Introduce the 
function 

/DO fOC fjf 
/ \fx,Y(t,s)-fx(t)f Y (s)\ 2 -^^. 
-co J — CO v S 

Then V 2 (X,Y) = F(p)/cj = F(p)/ir 2 and 

V 2 (A,y) F{p) 



(4.1) n\x,Y) 
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Theorem 7. If X and Y are standard normal, with correlation p 
p(X,Y), then 

(i) n(x,Y)<\ P \, 

/"\ <r>2/Y v\ _ P arcsin P+ V 1 _ P 2 -P arcsin p/2- ->/4- p 2 + 1 
(nj i +7r /3-v^ > 

(-) in Wo ^fp = Hm^o ^ = 2(1+V3 U)^ ^ °- 89066 - 



Proof, (i) If X and Y are standard normal with correlation p, then 

dt ds 

.2 ,a ^ 2 n - 2 . . _ dt (is 

> ■ j— (~P ts ) 72 "2 



/ e -' 2 - s2 (l-2e- p * s + e- 2pts ; 

JM 2 



P 2 



n=2 

f2 2 2k -2 . , 2k dt ds 

£w'*" ,, //"'" , ''") , *" ,,iii 



Thus F(p) = p 2 G(p), where G(p) is a sum with all nonnegative terms. The 
function G{p) is clearly nondecreasing in p and G(p) < G(l). Therefore 

V 2 (X Y) _ F (P) _ n 2°(p) < 2 

n{x > Y) -F(r)- p W) p ' 

or equivalently, 1Z(X,Y) < \p\. 

(ii) Note that F(0) = F'(0) = so F(p) = fg J* F"(z)dzdx. The second 
derivative of F is 



where 

V(z)= [ e~ t2 ~ s2 ~ 2zts dtds 



Here we have applied a change of variables, used the fact that the eigenvalues 
of the quadratic form t 2 + s 2 + 2zts are l±z, and J™ oo e~ t2x dt = (7r/\) 1 / 2 . 
Then 

W 7o Jo \VT^ ^\-z 2 l\) 
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Fig. 1. Dependence coefficient 7Z (solid line) and correlation p (dashed line) in the 
bivariate normal case. 



fp 

= 4-7T / (arcsin(x) — arcsin(a;/2)) dx 
Jo 

= 47r(parcsin p + — p 2 — parcsin(/9/2) — y 1 A — p 2 + 1), 

and (4.1) implies (ii). 

(iii) In the proof of (i) we have that 1Z/\p\ is a nondecreasing function of 
and lim lpH0 K(X,Y)/\p\ = (l + vr/3- \/3)~ 1/2 /2 follows from (ii). □ 

The relation between 1Z and p derived in Theorem 7 is shown by the plot 
of 1Z 2 versus p 2 in Figure 1. 

5. Empirical results. In this section we summarize Monte Carlo power 
comparisons of our proposed distance covariance test with three classical 
tests for multivariate independence. The likelihood ratio test (LRT) of the 
hypothesis Hq : S12 = 0, with p unknown, is based on 

det(S) _ det(5 22 - jgi^^gig) 
1 ' ' det(5n)det(5 22 ) ~ det(5 22 ) 

where det(-) is the determinant, S, S±\ and 5 22 denote the sample covari- 
ances of (X,Y), X and Y, respectively, and S12 is the sample covariance 
Cov(X, Y). Under multivariate normality, 

W = 2 log A = - n log det (I - S 2 \ S^ 1 S 12 ) 
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Table 1 

Empirical Type-I error rates for 10,000 tests at nominal significance level 0.1 in Example 
1 (p = 0), using B replicates for V and BartletVs approximation for W , T and S 







(a) Multivariate normal, p = 


q = 5 




(b) t(l),p 


= q = 5 




n 


B 


V 


W 


T 


s 


V 


W 


T 


S 


25 


400 


0.1039 


0.1089 


0.1212 


0.1121 


0.1010 


0.3148 


0.1137 


0.1120 


30 


366 


0.0992 


0.0987 


0.1145 


0.1049 


0.0984 


0.3097 


0.1102 


0.1078 


35 


342 


0.0977 


0.1038 


0.1091 


0.1011 


0.1060 


0.3102 


0.1087 


0.1054 


50 


300 


0.0990 


0.0953 


0.1052 


0.1011 


0.1036 


0.2904 


0.1072 


0.1037 


70 


271 


0.1001 


0.0983 


0.1031 


0.1004 


0.1000 


0.2662 


0.1013 


0.0980 


100 


250 


0.1019 


0.0954 


0.0985 


0.0972 


0.1025 


0.2433 


0.0974 


0.1019 








(c) t(2), p 


= q = 5 






(d) t(3), p 


= q = 5 




n 


B 


V 


W 


T 


S 


V 


W 


T 


S 


25 


400 


0.1037 


0.1612 


0.1217 


0.1203 


0.1001 


0.1220 


0.1174 


0.1134 


30 


366 


0.0959 


0.1636 


0.1070 


0.1060 


0.1017 


0.1224 


0.1143 


0.1062 


35 


342 


0.0998 


0.1618 


0.1080 


0.1073 


0.1074 


0.1213 


0.1131 


0.1047 


50 


300 


0.1033 


0.1639 


0.1010 


0.0969 


0.1050 


0.1166 


0.1065 


0.1046 


70 


271 


0.1029 


0.1590 


0.1063 


0.0994 


0.1037 


0.1200 


0.1020 


0.1002 


100 


250 


0.0985 


0.1560 


0.1050 


0.1007 


0.1019 


0.1176 


0.1066 


0.1033 



has the Wilks Lambda distribution A(q,n — l—p,p) [14]. Puri and Sen [8], 
Chapter 8, proposed similar tests based on more general sample dispersion 
matrices T = (Tij). The Puri-Sen tests replace S, Sn, S12 and S22 i n (5-1) 
with T, T\i, T12 and T22- For example, T can be a matrix of Spearman's 
rank correlation statistics. For a sign test the dispersion matrix has entries 
^ Y^j=i sign(Zjfc — Zfc)sign(Zj m — Z m ), where Zk is the sample median of the 
kih variable. Critical values of the Wilks Lambda and Puri-Sen statistics are 
given by Bartlett's approximation: if n is large and p, q>2, then — (n — \{p + 
q + 3))logdet(I — S21S11 S12) has an approximate x 2 (pq) distribution 
([6], Section 5.3.2b). 

To implement the distance covariance test for small samples, we obtain 
a reference distribution for nV^ under independence by conditioning on the 
observed sample, that is, by computing replicates of nV^ under random 
permutations of the indices of the Y sample. We obtain good control of 
Type-I error (see Table 1) with a small number of replicates; for this study 
we used [200 + 5000/nJ replicates. Implementation is straightforward due 
to the simple form of the statistic. The statistic nV^ has 0(n 2 ) time and 
space computational complexity. Source code for the test implementation is 
available from the authors upon request. 

Each example compares the empirical power of the dCov test (labeled V) 
with the Wilks Lambda statistic (W), Puri-Sen rank correlation statistic (£*) 
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n 

Fig. 2. Example 1(a): Empirical power at 0.1 significance and sample size n, multivariate 
normal alternative. 

and Purr-Sen sign statistic (T). Empirical power is computed as the propor- 
tion of significant tests on 10,000 random samples at significance level 0.1. 

Example 1. In 1(a) the marginal distributions of X and Y are standard 
multivariate normal in dimensions p = q = 5 and Cov(Xfc, Yj) = p for k, I = 
1, . . . , 5. The results displayed in Figure 2 are based on 10,000 tests for each 
of the sample sizes n = 25 : 50 : 1, 55 : 100 : 5, 110 : 200 : 10 with p = 0.1. As 
expected, the Wilks LRT is optimal in this case, but power of the dCov 
test is quite close to W. Table 1 gives empirical Type-I error rates for this 
example when p = 0. 

In Examples l(b)-l(d) we repeat 1(a) under identical conditions except 
that the random variables Xk and Y\ are generated from the t{y) distribu- 
tion. Table 1 gives empirical Type-I error rates for u = 1,2,3 when p = 0. 
Empirical power for the alternative p = 0.1 is compared in Figures 3-5. (The 
Wilks LRT has inflated Type-I error for v = 1,2,3, so a power comparison 
with W is not meaningful, particularly for v = 1,2.) 

Example 2. The distribution of X is standard multivariate normal 
(p = 5), and Y]y = X^jSkj, j = 1, . . . ,p, where £kj are independent standard 
normal variables and independent of X. Comparisons are based on 10,000 
tests for each of the sample sizes n = 25 : 50 : 1, 55 : 100 : 5, 110 : 240 : 10. The 
results displayed in Figure 6 show that the dCov test is clearly superior to 
the LRT tests. This alternative is an example where the rank correlation 
and sign tests do not exhibit power increasing with sample size. 
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Example 3. The distribution of X is standard multivariate normal (p = 
5), and Y^j = log(X|-), j = 1, . . . ,p. Comparisons are based on 10,000 tests 
for each of the sample sizes n = 25 : 50 : 1, 55 : 100 : 5. Simulation results are 
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n 



Fig. 6. Example 2: Empirical power at 0.1 significance and sample size n against the 
alternative Y = Xe. 



displayed in Figure 7. This is an example of a nonlinear relation where 
nVn achieves very good power while none of the LRT type tests performs 
well. 
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Fig. 7. Example 3: Empirical power at 0.1 significance and sample size n against the 
alternative Y = log(X 2 ). 

6. Summary. We have introduced new distance measures of dependence 
dCov, analogous to covariance, and dCor, analogous to correlation, defined 
for all random vectors with finite first moments. The dCov test of multivari- 
ate independence based on the statistic riV^ is statistically consistent against 
all dependent alternatives with finite expectation. Empirical results suggest 
that the dCov test may be more powerful than the parametric LRT when 
the dependence structure is nonlinear, while in the multivariate normal case 
the dCov test was quite close in power to the likelihood ratio test. Our pro- 
posed statistics are sensitive to all types of departures from independence, 
including nonlinear or nonmonotone dependence structure. 



APPENDIX 



Proof of statement (2.18). By definition (2.8) 
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- a -lb.i + ^ a -l b k- +n^2a.ib.i - a 4 b.. 

I k,l l l 

+ a..b.. — n^a..6fc. — n^a..6.; + n 2 a..b.., 

k I 

where a^. = na/%., a.i = na.i, b^. = nbk- an d b.i = nb.i. Applying the identities 



1 

(A.2) s i = -oz2 akibki, 

n k% 

1 n l n 
(A.3) S 2 = ^ a«-2 H hi = a..b.., 

n k,l=l 1 k,l=i 

n 7 n 

(A.4) n 2 S 2 = n 2 a..L = — V ^ = V a k .b. 



^ n n 1 n 1 n 



(A.5) 5 3 = -3 J] I] Ofc/^m = -3 a k- h k- = ~Yj a k- h k- 



fe=l (,m=l fe=l fc=l 

to (A.I), we obtain 

n 2 V 2 = n 2 S l - n 2 5 3 - n 2 S 3 + n 2 S 2 

- n 2 S 3 + n 2 S 3 + n 2 5 2 - n 2 S 2 

- n 2 S 3 + n 2 S 2 + n 2 S 3 - n 2 S 2 

+ n 2 S 2 - n 2 S 2 - n 2 S 2 + n 2 S 2 = n 2 (S 1 + S 2 - 2S 3 ). □ 
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referee for many suggestions that greatly improved the paper. 
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